- Study the molecular basis of variation in development and disease
- Using high-throughput experimental methods
Septmeber 21, 2015
build a whole human genome sequencing device and use it to sequence 100 human genomes within 30 days or less, with an accuracy of no more than one error in every 1,000,000 bases sequenced, with an accuracy rate of at least 98% of the genome, and at a recurring cost of no more than $1,000 (US) per genome.
“genome sequencing technology is plummeting in cost and increasing in speed independent of our competition”
“companies can do this for less than $5,000 per genome, in a few days or less — and are moving quickly towards the goals we set for the prize.”
NHGRI strategic plan
NHGRI strategic plan
"The major bottleneck in genome sequencing is no longer data generation—the computational challenges around data analysis, display and integration are now rate limiting. New approaches and methods are required to meet these challenges."
What makes them different?
Much human variation is due to difference in ~ 6 million DNA base pairs (0.1 % of genome)
What makes them different?
Same genome is expressed differently during different stages and in different tissues.
DNA is packed, making parts inaccessible, and this packing is dynamic!
DNA methylation is a chemical modification of DNA, regulates gene expression.
Measuring DNA methylation and understanding role in expression regulation in solid tumors
Large blocks of hypo-methylation (sometimes Mbps long) in colon cancer
Genes with hyper-variable expression in colon cancer are enriched within these blocks.
Genes with consistent hyper-variable expression across tumors are tissue-specific.
Motivated by observations that gene expression hyper-variability is enriched in specific regions of epigenetic alteration in colon cancer…
…and that consistent hypervariability across tumor types is enriched in genes involved in tissue specificity
anti-profile score: measures sample-specific deviation from normal expression in consistently hyper-variable genes
Good cross-experiment properties
Stability in normal expression across experiments
Prediction in leave-one-tissue out experiment
Anti-profile score distinguishes between stages in tumor progression
DNA methylation anti-profiles score distinguishes between stages in tumor progression
Stratification based on anti-profile score
Stratification of breast samples based on anti-profile score
Support Vector Machines for Anomaly Detection: determine if observations belong to a given group or are anomalies.
Learning functions in space spanned by (representers) of normal samples
\[ f(x) = \sum_i c_i k(x, z_i) + d \]
where \(z_i\) are normal observations.
Estimated as solution to optimization problem (like regular SVM) by solving
\[ \min_{c,d} \sum_j (1-y_jf_j)_+ + c'\tilde{K}c \]
with \(f_j = \sum_i c_i k(x_j,z_i) + d\), and
\(\tilde{K}=K_s K_n^{-1} K_s\)
Prediction of high vs. low relapse risk in lung cancer
Prediction of suspect vs. pathological fetal CTG data (not genomics)
Bsmooth, minfi) Florin Chelaru, UMD
Chelaru, et al. Nature Methods, 2014.
epivizr packageUsing the epivizr package
epivizr sessionmgr <- startEpiviz(workspace="qyOTB6vVnff")
# Get tumor methylation base-pair data m <- assay(se)[,"tumor"] # Compute regions with highest variability across cpgs region_stat <- calcWindowStat(m, step=25, window=80, stat=rowSds) s <- region_stat[,"stat"]
Using the epivizr package: browse by regions of interest.
# get locations in decreasing order o <- order(s, decreasing=TRUE) indices <- region_stat[o, "indices"] slideShowRegions <- rowRanges(se)[indices] + 1250000L mgr$slideshow(slideShowRegions)
Our architecture is dynamically extensible. We can easily integrate new data types and add new visualizations.
Example: adding a new visualization
http://epiviz.cbcb.umd.edu/?gist[]=11017650&ws=Y8kWxCO2Ajn
http://epiviz.cbcb.umd.edu/?ws=SRHZlWRRAPd&gist[]=a82a998817564ce3fe48&settings=default&
One interpretation of Big Data is Many relevant sources of contextual data
We are building a software system to support creative exploratory analysis of epigenome-wide datasets…

Computed Measurements: create new measurements from integrated measurements and visualize
Summarization on the fly: create new measurements from integrated measurements and visualize
Beyond genomics and epigenomics: metagenomics

Beyond genomics and epigenomics: metagenomics
Coordinates:

Beyond genomics and epigenomics: metagenomics
Samples:
Beyond genomics and epigenomics: metagenomics
Beyond genomics and epigenomics: metagenomics
Wikum Dinalankara: (formerly) Ph.D. student @ U. Maryland (now postdoc @ Johns Hopkins)
Nick Thieme: Ph.D. student @ U. Maryland
Jeff Leek, Rafael A. Irizarry, Andy Feinberg
Corrada Bravo, et al. (2012) BMC Bioinformatics.
Dinlanakara et al. (2015) Cancer Informatics.
Dinalankara et al. In preparation